Unstructured Data Support

erwin Data Intelligence provides support for unstructured data, enabling your organization to ingest, analyze, and govern files that do not follow a predefined schema. This capability extends metadata management to the documents and media files, allowing previously unmanaged content to be brought into the erwin DI governance framework.

It supports a wide range of unstructured file formats, including the following:

  • Document Formats:
    PDF, Word (DOC/DOCX), Excel (XLS/XLSX), Text (TXT), RDF, PPT, ODP, HTML, ODT, and Markdown.
  • Media/Binary Formats:
    PNG, JPG/JPEG, GIF, BMP, WebP, and TIFF.

With enhanced unstructured data support, such files are ingested and their content is collected for processing. During processing, erwin DI applies Optical Character Recognition (OCR) to extract text from image-based files and uses AI-driven techniques such as Natural Language Processing (NLP) and pattern recognition to interpret and classify content. The extracted information is then transformed into governed metadata assets, represented as tables, columns, and attributes within the catalog. Additionally, Sensitive Data Identification (SDI) is applied automatically to detect and tag sensitive information.

Files up to 1 GB can be processed directly through the UI, while larger files can be best handled using Scheduled Scans to minimize performance impact.

Once published, the metadata is available for search, governance, and analysis.

Profile Unstructured Data

To profile data for automated ingestion, follow these steps:

  1. Click New Environment.
  2. The New Environment page appears and displays supported database in the Datasources tab.


  3. Select Other.
  4. The Configuration Details tab appears and displays connection details for Other datasources. The connection details vary based on database selection.

    Alternatively, enter a keyword in the search bar to search for datasources.
    Enter appropriate values in the fields. Fields marked with a red asterisk are mandatory.

  5. Switch to the Connection Properties tab and set the connection properties as follows.

    Field Name

    Description

    Driver Name

    Specifies the JDBC driver name for connecting to the database.

    For example,

    com.quest.erwin.jdbc.OpenJDBCDriver

    URL

    Specifies the full JDBC URL that represents the location from which unstructured data files are read. All files located within this path and its subfolders are considered for scanning.

    This field requires a network drive path that the Tomcat service has access to.

    For example,

    jdbc:openjdbc:C:\XYZ

  6. Click to save connection.

This allows erwin DI to ingest and scan all the files.